Solving POMDP by On-Policy Linear Approximate Learning Algorithm
نویسندگان
چکیده
This paper presents a fast Reinforcement Learning (RL) algorithm to solve Partially Observable Markov Decision Processes (POMDP) problem. The proposed algorithm is devised to provide a policy-making framework for Network Management Systems (NMS) which is in essence an engineering application without an exact model. The algorithm consists of two phases. Firstly, the model is estimated and policy is learned in a completely observable simulator. Secondly, the estimated model is brought into the partially observed real-world where the learned policy is then ne-tuned. The learning algorithm is based on the on-policy linear gradient-descent learning algorithm with eligibility traces. This implies that the Q-value on belief space is linearly approximated by the Q-value at vertex over the belief space where on-line TD method will be applied. The proposed algorithm is tested against the exact solutions to extensive small/middle-size benchmark examples from POMDP literature and found near optimal in terms of average-discounted-reward and step-togoal. The proposed algorithm signi cantly reduces the convergence time and can easily be adapted to large state-number problems.
منابع مشابه
Model-Based Online Learning of POMDPs
Learning to act in an unknown partially observable domain is a difficult variant of the reinforcement learning paradigm. Research in the area has focused on model-free methods — methods that learn a policy without learning a model of the world. When sensor noise increases, model-free methods provide less accurate policies. The model-based approach — learning a POMDP model of the world, and comp...
متن کاملA (Revised) Survey of Approximate Methods for Solving Partially Observable Markov Decision Processes
Partially observable Markov decision processes (POMDPs) are interesting because they provide a general framework for learning in the presence of multiple forms of uncertainty. We survey methods for learning within the POMDP framework. Because exact methods are intractable we concentrate on approximate methods. We explore two versions of the POMDP training problem: learning when a model of the P...
متن کاملDialogue POMDP components (Part II): learning the reward function
The partially observable Markov decision process (POMDP) framework has been applied in dialogue systems as a formal framework to represent uncertainty explicitlywhile being robust to noise. In this context, estimating the dialogue POMDP model components (states, observations, and reward) is a significant challenge as they have a direct impact on the optimized dialogue POMDP policy. Learning sta...
متن کاملMonitoring plan execution in partially observable stochastic worlds
This thesis presents two novel algorithms for monitoring plan execution in stochastic partially observable environments. The problems can be naturally formulated as partially-observable Markov decision processes (POMDPs). Exact solutions of POMDP problems are difficult to find due to the computational complexity, so many approximate solutions are proposed instead. These POMDP solvers tend to ge...
متن کاملOn Partially Observable Markov Decision Processes Using Genetic Algorithm Based Q-Learning
As powerful probabilistic models for optimal policy search, partially observable Markov decision processes (POMDPs) still suffer from the problems such as hidden state and uncertainty in action effects. In this paper, a novel approximate algorithm Genetic algorithm based Q-Learning (GAQ-Learning), is proposed to solve the POMDP problems. In the proposed methodology, genetic algorithms maintain ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999